Ford GoBike Data Exploration

Preliminary Wrangling

This document explores "Ford GoBike System Data" dataset which includes information about individual rides made in a bike-sharing system covering the greater San Frascisco Bay area.

What is the structure of your dataset?

There are 183,412 trips in the dataset with 16 features (duration_sec, start_time, end_time, start_station_id, start_station_name,start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, member_birth_year, member_gender, and bike_share_for_all_trip). Most variables are numeric in nature, but the variables start_station_name, end_station_name, bike_share_for_all_trip, user_type, and member_gender are qualitative.

What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out what features are best for predicting the duration of the trips in the dataset.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that distance between start station and end station will have the strongest effect on each trip's duration: the larger the distance, the higher the duration. I also think that user_type, member_age, member_gender, and period of the trip, will have effects on the duration.

There is some data wrangling needed:

  • there are some missing values in member_gender, member_birth_year, start_station_id and end_station_id. Remove these rows.
  • there are some incorrect values for example people who born in 1878. There are some outliers regarding the age, people born before 1920. They will be removed as well.
  • member_birth_year shoud not be a float. Convert to integer type.
  • start_time and end_time should be converted into datetime data type.
  • create a column member_age, based on member_birth_year. I'm interested on knowing if the age affects to the duration of the trips.
  • create a new column duration_minute. It could be very useful
  • new columns for day of week, month and hour will be created for better insight in the data.
  • create a column distance between start station and end station based on coordinates.
  • user_type and member_gender should be categories. Convert to category type
  • drop unnecessary columns for the analysis.

Data Wrangling

What is the structure of your dataset after wrangling the original one?

There are 174,877 rides in the dataset with 14 features (duration_sec, start_time, end_time, start_station_id, start_station_name,start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, member_birth_year, member_gender, and bike_share_for_all_trip). Most variables are numeric in nature, but the variables start_station_name, end_station_name, bike_share_for_all_trip, user_type, and member_gender are qualitative.

Univariate Exploration

I'll start by looking at the distribution of the main variable of interest: duration_minute.

There are some outliers. This is the reason why we are going to check the duration of the majority of trips via 99 percentile.

The majority of the trips (99%) have a duration smaller than 53 minutes. Let's adjust the bins and ticks.

Let's remove the outliers from the cleaned dataset

Duration has a long-tailed distribution, with a lot of trips on the low duration end, and few on the high duration end. When plotted on a log-scale, the duration distribution looks unimodal, with the highest peak around 10 minutes.

Next up, the first predictor variable of interest: distance.

There are some outliers related to distance too. This is the reason why we are going to check the distance of the majority of trips via 99 percentile.

The majority of the trips (99%) have a distance smaller than 6 kilometers. Also, there are some outliers points with a distance of 0 kilometers. Let's adjust the bins.

Let's remove the outliers related to distance too.

The distribution of trip distance is right skewed and it has unimodal shape, being the most frequent distance between 1 and 2 kilometers. Next up, the first predictor variable of interest: member_age.

There are some outliers related to member_age. Let's figure out them.

The majority of the trips (99%) are done by members with an age lower than 65 years old. Let's remove the outliers and plot the distribution again.

The distribution of member age is right skewed and it has unimodal shape. The most frequent range of member age is between 25 and 45 years old.

I'll now move on to the categorical variables related to the start period of the trip in the dataset: start_month, start_day_of_week, and start_hour.

I'll now move on to the categorical variables related to the member characteristics in the dataset: user_type, and member_gender.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The duration_minute variable took on a large range of values, I detected that there were outliers. Since 99% of trips had a duration lower than 60 minutes, I decided to remove them from the dataset for safety reasons. Also, I looked at the data using a log transform and the data looked unimodal. The peak is around 10 minutes.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When I was investigating distance and member_age variables, I identified a number of outlier points related to both variables. I decided to remove the outliers points from the dataset to move forwards and for safety reasons. I included as part of the data wrangling to order categorically start_day_of_week and start_month variables, for cleaner visualizations.

Bivariate Exploration

To start off with, I want to look at the pairwise correlations present between features in the data.

As expected, 'duration_minute' and 'distance' are positive correlated. So, when the duration of the trip increases the distance will increase as well. However, I expected that a stronger correlation that the current one. On the other hand, 'member_age' isn't correlated with 'duration_minute' and 'distance' variables.

Let's move on to looking at how duration, distance and member age correlate with the categorical variables related to member characteristics (user_type and member_age).

There are some relationships between the member categorical variables and the numeric variables of interest. Subscribers made shorter trip than customer in duration and in distance. Also, subscribers seem to be older than customers. Regarding member gender, the relationships are very similar. Females made slightly larger trips than males. However, males seems to be slightly older than females.

Finally, let's look at relationships between user_type and trip period categorical features.

Subscribers made their trips mainly over weekdays, the peak is on Thursdays. However, Customers trips are distributed in a similar way through whole the week.

Subscriber's trips are more frequent from 7am till 9am and from 4pm till 6pm which correspond to commute times. Customer's trips are distributed mainly over the 7am and 7pm and the frequency is similar. There is a peak at 5pm, however it doesn't seem that customers use the service for commuting to work.

With the preliminary look at bivariate relationships out of the way, I want to dig into some of the relationships more. First, I want to compare how duration and distance are related for all of the data.

This plot suggests that there is a positive correlation between the trip duration and the trip distance. As the trip duration increases the trip distance will increase too.

The plot of the full data using a violin plot reveals more or less the same than the earlier box plots. It is highly suggested that subscribers made shorter trips than customers in terms on duration and distance. Regarding member gender, the plot shows that this variable doesn't have any relationship with the duration and trip distance neither.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Duration had a positive amount of correlation with the distance. However, the correlation isn't very strong.

There was also an interesting relationship observed between duration and distance and the categorical feature user_type. The member gender doesn't have any relationship with distance and duration.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are relationships between the user_type and the start_day_of_week and start_hour. While subscribers use the service mainly for commuting to work, customers use the service across the full week, some of them for commuting as well but some of them for leisure reasons.

Multivariate Exploration

The main thing I want to explore in this part of the analysis is how the type of user and the categorical variables related to the start trip period (start_day_of_week and start_hour) play into the relationship between duration.

Subscribers have a similar duration through all weekdays. Customers made larger trips in terms on duration than subscribers, this difference is specially higher over the weekends.

Susbcribers use the service mainly for commuting to work, since the most ammount of trips for them are done as 8am and 5pm which correspond with office hours. There are some customers that use the service for commuting as well. The amount of subscribers that use the service over the weekend is higher than the customer who use the service on weekends. However, the great majority of the customers use the service over the weekend.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I extended the investigation about how the type of user and the categorical variables related to the start trip period (start_day_of_week and start_hour) play into the relationship between duration. These features helped to strengthen the relationships between them. The majority of subscribers use the service mainly over weekdays and it's their way to commute to the office. Customers use the service also for commuting to the office, but the majority of the customers use the service also during weekends from 10am till 6pm. However, the amount of subscribers who use the service on weekends is greater than the amount of customers who use the service on weekends.

Were there any interesting or surprising interactions between features?

It's interesting how the type of user have a different behaviour in the way of using the service and it makes sense the relationship with the duration.